image.png

Recipe Diet Classification with NLP¶


Table of Contents¶

  1. Problem Statement Formulation and Definition
    • Motivation
    • Problem Statement
    • Expected Results
  2. Selection of an appropriate data set (Data Collection)
    • Data Selection and Justification
    • Data Visualization and Exploratory Data Analysis
    • Data Preprocessing
  3. Text Preprocessing
    • Lowercasing
    • Special Characters and Numbers Removal
    • Tokenization
    • Stopwords Removal
    • Lemmatization
    • Preprocessed Text Visualization
  4. Text Representation
    • Bag of Words (BoW)
    • Term Frequency - Inverse Document Frequency (TF-IDF)
    • Word Embeddings with Word2Vec
  5. Text Classification / Prediction
    • Text Classification with Bag of Words Vectors
    • Text Classification with TF-IDF
  6. Evaluation, Inferences, Recommendations and Reflection
    • Evaluation
    • Inferences
    • Recommendations and Reflection
  7. Log Files
  8. Github Link
  9. References

1. Problem Statement Formulation and Definition¶

Motivation¶

The motivation behind this project stems from the interest in understanding the relationship between the ingredients and diet of a recipe from a calories and protein intake perspective. The availability of an ingredient-based recipe diet categorization can help in identifying if the recipe or dish is suitable for based on the "calories" and "protein" diet categorization.

Problem Statement¶

This project aims to develop an ingredient-based recipe diet classification model that identify if a recipe is low calorie, high protein, or other based on the ingredients.

Expected Results¶

The developed NLP model is expected to be able to properly identify if the recipe is a low calorie, high protein, or other diet recipe based on the given ingredients as it should be able to properly process the data of the recipes diet category and ingredients.

2. Selection of an appropriate data set (Data Collection)¶

Data Selection and Justification¶

Data Source: DOI

The dataset used for this project is the "Food.com Recipes and Interactions" dataset from Kaggle (Li, 2019). It contains a rich collection of recipe records extracted from a recipes website which is Food.com. Moreover, the records of this dataset is highly relatable to the project objective of identifying recipe diet based on ingredients as each record contains the required information which are the recipe ingredients that the NLP model will mainly depend on, in addition to the recipe nutrition information including the calories and protein which will be used to create the diet category based on its values.

Data Visualization and Exploratory Data Analysis¶

Project Imports

In [ ]:
# for data
import pandas as pd
import ast
from collections import Counter
import numpy as np

# for visualization
import plotly.express as px
import plotly.io as pio
import plotly.subplots as sp
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# for text processing
import re
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# for encoding categorical labels
from sklearn.preprocessing import LabelEncoder

# for text representation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

# for model development
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


# for model evaluation
import scikitplot as skplt
from sklearn.metrics import classification_report


nltk.download('punkt') # for tokenization
nltk.download('stopwords')
nltk.download('wordnet') # for lemmatization

# to ensure plotly graphs are exported
pio.renderers.default = "plotly_mimetype+notebook" 
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alaa2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alaa2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alaa2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Link to Dataset: https://drive.google.com/file/d/1pAIYQuIbaHVpr_v7cmur723GGSxyusJ-/view?usp=sharing

In [ ]:
# read the dataset and save it into a pandas dataframe (df)
data = pd.read_csv("data/recipes/RAW_recipes.csv")
In [ ]:
# display first 5 rows of the data
data.head()
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
0 arriba baked winter squash mexican style 137739 55 47892 2005-09-16 ['60-minutes-or-less', 'time-to-make', 'course... [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] 11 ['make a choice and proceed with recipe', 'dep... autumn is my favorite time of year to cook! th... ['winter squash', 'mexican seasoning', 'mixed ... 7
1 a bit different breakfast pizza 31490 30 26278 2002-06-17 ['30-minutes-or-less', 'time-to-make', 'course... [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] 9 ['preheat oven to 425 degrees f', 'press dough... this recipe calls for the crust to be prebaked... ['prepared pizza crust', 'sausage patty', 'egg... 6
2 all in the kitchen chili 112140 130 196586 2005-02-25 ['time-to-make', 'course', 'preparation', 'mai... [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] 6 ['brown ground beef in large pot', 'add choppe... this modified version of 'mom's' chili was a h... ['ground beef', 'yellow onions', 'diced tomato... 13
3 alouette potatoes 59389 45 68585 2003-04-14 ['60-minutes-or-less', 'time-to-make', 'course... [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] 11 ['place potatoes in a large pot of lightly sal... this is a super easy, great tasting, make ahea... ['spreadable cheese with garlic and herbs', 'n... 11
4 amish tomato ketchup for canning 44061 190 41706 2002-10-25 ['weeknight', 'time-to-make', 'course', 'main-... [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] 5 ['mix all ingredients& boil for 2 1 / 2 hours ... my dh's amish mother raised him on this recipe... ['tomato juice', 'apple cider vinegar', 'sugar... 8

Exploratory Data Analysis (EDA)¶

In [ ]:
# data summary
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            231636 non-null  object
 1   id              231637 non-null  int64 
 2   minutes         231637 non-null  int64 
 3   contributor_id  231637 non-null  int64 
 4   submitted       231637 non-null  object
 5   tags            231637 non-null  object
 6   nutrition       231637 non-null  object
 7   n_steps         231637 non-null  int64 
 8   steps           231637 non-null  object
 9   description     226658 non-null  object
 10  ingredients     231637 non-null  object
 11  n_ingredients   231637 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 21.2+ MB
Dataset Summary¶

There are 231637 records and 12 columns in the dataset with 5 columns containing numerical data and 7 columns containing objects as can be seen in the dataframe summary above which contains the columns names as well.

Statistics for numeric values¶
In [ ]:
stats = data[["minutes", "n_steps", "n_ingredients"]].describe()
stats.astype(int)
Out[ ]:
minutes n_steps n_ingredients
count 231637 231637 231637
mean 9398 9 9
std 4461963 5 3
min 0 0 1
25% 20 6 6
50% 40 9 9
75% 65 12 11
max 2147483647 145 43

Data Visualization¶

In [ ]:
ingredients_fig = px.histogram(
    data,
    x="n_ingredients",
    title="Distribution of Recipe's number of ingredients",
    labels={"n_ingredients": "Number of Ingredients", "count": "Count"},
    marginal="box",
)
ingredients_fig.update_layout(bargap=0.2)
ingredients_fig.show()
In [ ]:
steps_fig = px.histogram(
    data,
    x="n_steps",
    title="Distribution of Recipe's number of preparation steps",
    labels={"n_steps": "Number of Preparation Steps", "count": "Count"},
    marginal="box",
)
steps_fig.update_layout(bargap=0.2)
steps_fig.show()
In [ ]:
time_fig = px.histogram(
    data,
    x="minutes",
    title="Distribution of Recipe's preparation time",
    labels={"minutes": "Preparation Time", "count": "Count"},
    marginal="box",
)
time_fig.update_layout(bargap=0.2)
time_fig.show()

The visualization is not clear due to the huge values of the outliers.

Based on the box plot the upper fence of the recipes preparation time is 132. Therefore, a subset of the preparation time column is created with an upper limit of 132 to get a more clear visualization.

In [ ]:
# create a copy of the dataframe with only preparation time less than or equal to 132
df_filtered_minutes = data[data["minutes"] <= 132]

A more clear visualization of the recipes preparation time.

In [ ]:
time_fig3 = px.histogram(
    df_filtered_minutes,
    x="minutes",
    title="Distribution of Recipe's preparation time (recipes with preparation time <= 132)",
    labels={"minutes": "Preparation Time", "count": "Count"},
    marginal="box",
)
time_fig3.update_layout(bargap=0.2)
time_fig3.show()

It is difficult to determine wether all the outliers in the analyzed 3 numerical columns are correct or wrong. Moreover, these columns are of not much significance for the case of identifying the diet category based on the ingredients which will mostly focus on the text and names of the ingredients. Therefore, these columns can be dropped from the dataset.

Data Preprocessing¶

The first step would be dropping the unrelated columns identified during data visualization

In [ ]:
# drop n_steps, n_ingredients, minutes columns
data.drop(
    columns=["n_steps", "n_ingredients", "minutes"],
    inplace=True,
)

Based on the dataset summary, two columns which are the name and description columns have null values.

In [ ]:
# checking null values
data.isnull().sum()
Out[ ]:
name                 1
id                   0
contributor_id       0
submitted            0
tags                 0
nutrition            0
steps                0
description       4979
ingredients          0
dtype: int64
  • 1 record is missing a name
  • 4979 records are missing a description
In [ ]:
# checking the values in description column
data["description"].head().values
Out[ ]:
array(['autumn is my favorite time of year to cook! this recipe \r\ncan be prepared either spicy or sweet, your choice!\r\ntwo of my posted mexican-inspired seasoning mix recipes are offered as suggestions.',
       'this recipe calls for the crust to be prebaked a bit before adding ingredients. feel free to change sausage to ham or bacon. this warms well in the microwave for those late risers.',
       "this modified version of 'mom's' chili was a hit at our 2004 christmas party. we made an extra large pot to have some left to freeze but it never made it to the freezer. it was a favorite by all. perfect for any cold and rainy day. you won't find this one in a cookbook. it is truly an original.",
       'this is a super easy, great tasting, make ahead side dish that looks like you spent a lot more time preparing than you actually do. plus, most everything is done in advance. the times do not reflect the standing time of the potatoes.',
       "my dh's amish mother raised him on this recipe. he much prefers it over store-bought ketchup. it was a taste i had to acquire, but now my ds's also prefer this type of ketchup. enjoy!"],
      dtype=object)

The number of missing values in the description column is very large and based on the available values, this column contains some descriptions or general information written by the uploader of the recipe. Therefore, it is identified that this column is of no relevance and can be dropped from the dataset.

In [ ]:
# drop description column
data.drop(
    columns=["description"],
    inplace=True,
)
In [ ]:
# checking the row with missing recipe name
missing_name_idx = data[data["name"].isnull()].index
data.loc[missing_name_idx].values
Out[ ]:
array([[nan, 368257, 779451, '2009-04-27',
        "['15-minutes-or-less', 'time-to-make', 'course', 'preparation', 'low-protein', 'salads', 'easy', 'salad-dressings', 'dietary', 'low-sodium', 'inexpensive', 'low-in-something', '3-steps-or-less']",
        '[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]',
        "['in a bowl , combine ingredients except for olive oil', 'slowly whisk inches', 'olive oil until thickened', 'great with field greens', 'makes about 2 / 3', 'cup dressing']",
        "['lemon', 'honey', 'horseradish mustard', 'garlic clove', 'dried parsley', 'dried basil', 'dried thyme', 'garlic salt', 'black pepper', 'olive oil']"]],
      dtype=object)

Regarding the missing value in the name column, the values of the record containing the missing name suggest that its a recipe of a salad dressing. This can be inferred mainly from the tags associated with the recipe and the ingredients. Accordingly, the null value can be replaced with "Salad Dressing" as a name.

In the case where multiple rows have a missing name, a possible solution for handling the null values could be through assigning a random name generated from the recipe tags.

In [ ]:
# assign the selected name in place of the missing name
data.loc[missing_name_idx, "name"] = "Salad Dressing"

# view the updates on the row
data.loc[missing_name_idx].values
Out[ ]:
array([['Salad Dressing', 368257, 779451, '2009-04-27',
        "['15-minutes-or-less', 'time-to-make', 'course', 'preparation', 'low-protein', 'salads', 'easy', 'salad-dressings', 'dietary', 'low-sodium', 'inexpensive', 'low-in-something', '3-steps-or-less']",
        '[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]',
        "['in a bowl , combine ingredients except for olive oil', 'slowly whisk inches', 'olive oil until thickened', 'great with field greens', 'makes about 2 / 3', 'cup dressing']",
        "['lemon', 'honey', 'horseradish mustard', 'garlic clove', 'dried parsley', 'dried basil', 'dried thyme', 'garlic salt', 'black pepper', 'olive oil']"]],
      dtype=object)

Identification of unrelated / irrelevant columns¶

In [ ]:
# viewing the first 5 rows of the dataframe to check the values
data.head()
Out[ ]:
name id contributor_id submitted tags nutrition steps ingredients
0 arriba baked winter squash mexican style 137739 47892 2005-09-16 ['60-minutes-or-less', 'time-to-make', 'course... [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] ['make a choice and proceed with recipe', 'dep... ['winter squash', 'mexican seasoning', 'mixed ...
1 a bit different breakfast pizza 31490 26278 2002-06-17 ['30-minutes-or-less', 'time-to-make', 'course... [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] ['preheat oven to 425 degrees f', 'press dough... ['prepared pizza crust', 'sausage patty', 'egg...
2 all in the kitchen chili 112140 196586 2005-02-25 ['time-to-make', 'course', 'preparation', 'mai... [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] ['brown ground beef in large pot', 'add choppe... ['ground beef', 'yellow onions', 'diced tomato...
3 alouette potatoes 59389 68585 2003-04-14 ['60-minutes-or-less', 'time-to-make', 'course... [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] ['place potatoes in a large pot of lightly sal... ['spreadable cheese with garlic and herbs', 'n...
4 amish tomato ketchup for canning 44061 41706 2002-10-25 ['weeknight', 'time-to-make', 'course', 'main-... [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] ['mix all ingredients& boil for 2 1 / 2 hours ... ['tomato juice', 'apple cider vinegar', 'sugar...

The columns 'id', 'contributor_id', and 'submitted' are irrelevant to the project's use case and the NLP model development as they are information related to the submission of the recipes in the website that the data was retrieved from. Hence, these columns can be dropped, in addition to the 'tags', 'name', and 'steps' column which are not required for this use case.

In [ ]:
# drop id, contributor_id, submitted, n_steps, n_ingredients, tags columns
data.drop(
    columns=["id", "contributor_id", "submitted", "tags", "name", "steps"],
    inplace=True,
)

Further processing¶

The nutrition column requires preprocessing to infer the required calories information from it and assign the proper categories for each record. Based on the data card details in kaggle where the dataset was retrieved from (Li, 2019), this column contains the nutrition information as in calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV) consecutively. Therefore, the nutrition column can be split into multiple columns to represent each value separately.

In [ ]:
# ensure that the nutrition column contains lists of floating values instead of a string
# using ast "Abstract Syntax Trees"
data["nutrition"] = data["nutrition"].apply(ast.literal_eval)

# create the new columns and assign the values to them
data[
    [
        "calories (#)",
        "total fat (PDV)",
        "sugar (PDV)",
        "sodium (PDV)",
        "protein (PDV)",
        "saturated fat (PDV)",
        "carbohydrates (PDV)",
    ]
] = data["nutrition"].to_list()

# drop the nutrition column as it's of no more use
data.drop(columns=["nutrition"], inplace=True)
In [ ]:
# update ingredients to ensure that they are lists of strings not one string objects
data["ingredients"] = data["ingredients"].apply(ast.literal_eval)
In [ ]:
# view updated dataframe
data.head()
Out[ ]:
ingredients calories (#) total fat (PDV) sugar (PDV) sodium (PDV) protein (PDV) saturated fat (PDV) carbohydrates (PDV)
0 [winter squash, mexican seasoning, mixed spice... 51.5 0.0 13.0 0.0 2.0 0.0 4.0
1 [prepared pizza crust, sausage patty, eggs, mi... 173.4 18.0 0.0 17.0 22.0 35.0 1.0
2 [ground beef, yellow onions, diced tomatoes, t... 269.8 22.0 32.0 48.0 39.0 27.0 5.0
3 [spreadable cheese with garlic and herbs, new ... 368.1 17.0 10.0 2.0 14.0 8.0 20.0
4 [tomato juice, apple cider vinegar, sugar, sal... 352.9 1.0 337.0 23.0 3.0 0.0 28.0
In [ ]:
# checking calories and protein columns statistics
data[["calories (#)", "protein (PDV)"]].describe()
Out[ ]:
calories (#) protein (PDV)
count 231637.000000 231637.00000
mean 473.942425 34.68186
std 1189.711374 58.47248
min 0.000000 0.00000
25% 174.400000 7.00000
50% 313.400000 18.00000
75% 519.700000 51.00000
max 434360.200000 6552.00000

The very large maximum value is mostly the effect of inconsistent units. As some entries seem to have the records in kilocalories(kcal), which seems to be the majority based on the statistics, while others are in calories(cal). Therefore, standardizing the units to kilocalories is required to have accurate diet categories.

In [ ]:
# One kilocalorie is equivalent to 1000 calories.
data["calories (kcal)"] = data["calories (#)"].apply(lambda x: x / 1000 if x > 1000 else x)

As there is a minimum value of 0 for the calories and protein, this might be due to missing input. Therefore, records with 0 for both calories and protein will be dropped to ensure data quality.

In [ ]:
data = data[(data["calories (kcal)"] > 0) & (data["protein (PDV)"] > 0)]

Calories visualization to analyze the updated statistics and distribution

In [ ]:
calories_fig = px.histogram(
    data,
    x="calories (kcal)",
    title="Distribution of recipe calories intake in kilocalories",
    labels={"calories (kcal)": "calories (kcal)", "count": "Count"},
    marginal="box",
)
calories_fig.update_layout(bargap=0.2)
calories_fig.show()

The current distribution is more balanced as all the values are in the same unit which is kilocalories.

Labels Identification¶

In [ ]:
# create a new column with empty strings
data["diet"] = ""

The calorie threshold is set to 100 based on the information from "Reading Food Nutrition Labels" about food nutrition values (Reading Food Nutrition Labels). While the protein threshold is set to 30, as the recommended protein intake per serving is between 15 - 30 (Wempen, 2022).

In [ ]:
def categorize_diet(row):
    # Define the standards for each nutritional value
    calorie_threshold = 100
    protein_threshold = 30

    # Categorize based on "calories (#)" and "protein (PDV)"
    if row["calories (#)"] < calorie_threshold:
        return "Low-Calorie Diet"
    elif row["protein (PDV)"] > protein_threshold:
        return "High-Protein Diet"
    else:
        return "Other Diet"

# Apply the categorize_diet function to create a new column "Diet Type"
# data["Diet Type"] = data.apply(categorize_diet, axis=1)
data["diet"] = data.apply(categorize_diet, axis=1)
In [ ]:
# Create the bar plot
fig = px.histogram(x=data['diet'], labels={'x':'Category'}, title='Categories Distribution')
fig.show()

Most of the recipes are of other diet, followed by high-protein diet and low-calorie diet being the least.

In order to balance the data, a subset of the dataset is taken with a similar number of records from each category.

In [ ]:
# use the minimum count of the diet category as the number of samples

min_count = data['diet'].value_counts().min()
n_samples = min_count if min_count<=10000 else 10000

# groupby 'diet' and take same amount of samples from each group
df = data.groupby('diet', group_keys=False).apply(lambda x: x.sample(min(len(x), n_samples), random_state=42))

# shuffle the rows in the dataframe to ensure that the similar category values are not grouped together
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Reset the index
df.reset_index(drop=True, inplace=True)
In [ ]:
# Create the bar plot
fig = px.histogram(x=df['diet'], labels={'x':'Category'}, title='Categories Distribution')
fig.show()
In [ ]:
# Instantiate the encoder
le = LabelEncoder()

# Fit and transform the diet labels
df["label"] = le.fit_transform(df['diet'])
In [ ]:
# dropping nutrition columns
df.drop(
    columns=[
        "calories (#)",
        "total fat (PDV)",
        "sugar (PDV)",
        "sodium (PDV)",
        "protein (PDV)",
        "saturated fat (PDV)",
        "carbohydrates (PDV)",
        "calories (kcal)",
    ],
    inplace=True,
)

Processed data summary and information¶

In [ ]:
# checking the data summary after the processing
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ingredients  30000 non-null  object
 1   diet         30000 non-null  object
 2   label        30000 non-null  int32 
dtypes: int32(1), object(2)
memory usage: 586.1+ KB

The processed data has 30000 records with 3 columns instead of the initial 12 columns as most of the initial columns were dropped except the ingredients. The other two columns are diet which is the categorical label, and label which is the encoded label.

Processed Data Visualization and Analysis¶

The analysis in this part focuses on the ingredients column as it is the column that the NLP model will depend on.

In [ ]:
# flatten all ingredients values into a 1D list
flatten_ingredients = [ingredient for ingredients in df['ingredients'].tolist() for ingredient in ingredients]

# Count the frequency of each ingredient
ingredient_counts = Counter(flatten_ingredients)

# Get the top 30 most common ingredients
top_ingredients = ingredient_counts.most_common(30)
steps, counts = zip(*top_ingredients)

# Create the bar plot
fig = px.bar(x=steps, y=counts, labels={'x':'Ingredients', 'y':'Counts'}, title='Top 30 Ingredient Frequencies')
fig.show()

The previous bar graph shows that salt is the most frequent ingredient in the dataset, followed by butter and sugar.

In [ ]:
print(
    "There are",
    len(ingredient_counts),
    "unique ingredients before processing the ingredients text.",
)
There are 8511 unique ingredients before processing the ingredients text.
In [ ]:
def plot_cloud(count: Counter, title: str) -> None:
    """
    This function plots a wordcloud
    based on the ingredients counts in the dataframe

    Parameters
    ----------
    cont : Counter
        The ingredients frequencies counter
    """
    wordcloud = WordCloud(
        width=1000,
        height=500,
        random_state=1,
        background_color="black",
        colormap="Pastel1",
        collocations=False,
        stopwords=STOPWORDS,
    ).generate_from_frequencies(count)


    plt.figure(figsize=(10, 5))
    plt.title(title)
    plt.imshow(wordcloud)
    plt.axis("off")
In [ ]:
plot_cloud(ingredient_counts, "Unprocessed Ingredients")
No description has been provided for this image

The wordcloud shows that salt, butter, sugar, water, eggs, and onion are the most frequent ingredients as identified previously in the bar graph.

3. Text Preprocessing¶

Based on the analysis of the ingredients values, there are stopwords, numbers and special characters, in addition to plurals. Therefore, the main text preprocessing techniques required will be special characters, numbers and stopwords removal, tokenization, and lemmatization.

In [ ]:
# create a new column in the dataframe to contain preprocessed ingredients 
# and initialize with a copy of ingredients
df["pp ingredients"] = df["ingredients"].copy()
In [ ]:
# display the new column for indexes 360 to 375
pd.set_option('display.max_colwidth', None) # to display the column in full width
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                           [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives]
361                                                                                                    [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper]
362                                                                                                               [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper]
363                                                                                                                                              [old fashioned oats, water, cinnamon, dried fruits, spices]
364                                         [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper]
365                                                                                                                                             [ground beef, ricotta cheese, basil pesto, egg, pasta sauce]
366    [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half-and-half]
367                                                                                              [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips]
368                                                                           [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam]
369                                                  [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs. dash seasoning mix, pepper, salt]
370                                                                                                                       [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper]
371                                                     [chicken breasts, sun-dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil]
372                                                                                                     [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper]
373                                                                         [zucchini, green onions, feta cheese, basil leaves, eggs, salt & fresh ground pepper, plain flour, baking powder, vegetable oil]
374                                             [butter, brown sugar, eggs, baking soda, all-purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel]
375                                                                                                                 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery]
Name: pp ingredients, dtype: object

Lowercasing¶

The first preprocessing technique applied on the text is lowercasing to ensure that all ingredients are in lowercase and to prevent having different cases of the words which can be interpreted as different words by the model.

In [ ]:
def lowercase_ingredients(ingredients: list[str]) -> list[str]:
    """
    This function lowercases all the ingredients

    Parameters
    ----------
    ingredients : list[str]
        a list of ingredients to be processed

    Returns
    -------
    list[str]
        a list of lowercased ingredients
    """
    ingredients = [ingredient.lower() for ingredient in ingredients]
    return ingredients
In [ ]:
df["pp ingredients"] = df["pp ingredients"].apply(lowercase_ingredients)
In [ ]:
# display the preprocessed ingredients column for indexes 360 to 375
# after applying lowercase
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                           [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives]
361                                                                                                    [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper]
362                                                                                                               [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper]
363                                                                                                                                              [old fashioned oats, water, cinnamon, dried fruits, spices]
364                                         [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper]
365                                                                                                                                             [ground beef, ricotta cheese, basil pesto, egg, pasta sauce]
366    [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half-and-half]
367                                                                                              [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips]
368                                                                           [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam]
369                                                  [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs. dash seasoning mix, pepper, salt]
370                                                                                                                       [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper]
371                                                     [chicken breasts, sun-dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil]
372                                                                                                     [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper]
373                                                                         [zucchini, green onions, feta cheese, basil leaves, eggs, salt & fresh ground pepper, plain flour, baking powder, vegetable oil]
374                                             [butter, brown sugar, eggs, baking soda, all-purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel]
375                                                                                                                 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery]
Name: pp ingredients, dtype: object

A quick look on the initial data shows that the ingredients are lowercased as it can be seen that there is no difference between the previous printed output of this subset of the data and the lowercased output. However, this step is to ensure that all the ingredients are lowercased.

Special Characters and Numbers Removal¶

This step includes special characters and numbers removal as some ingredients names may contain special characters especially the hyphen "-" and percentage "%" as they can be seen frequently in the ingredients names like "10% low-fat milk" where this ingredient might also be available as "low fat milk" in another recipe.

In [ ]:
def clean_text(ingredients: list[str]) -> list[str]:
    """
    This function cleans the ingredients from any special characters and numbers

    Parameters
    ----------
    ingredients : list[str]
        a list of ingredients to be processed

    Returns
    -------
    list[str]
        a list of ingredients after special characters and numbers removal
    """
    ingredients = [
        re.sub("[^A-Za-z ]+", " ", ingredient) for ingredient in ingredients
    ]
    return ingredients
In [ ]:
df["pp ingredients"] = df["pp ingredients"].apply(clean_text)
In [ ]:
# display the preprocessed ingredients column for indexes 360 to 375
# after applying special characters and numbers removal
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                           [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives]
361                                                                                                    [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper]
362                                                                                                               [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper]
363                                                                                                                                              [old fashioned oats, water, cinnamon, dried fruits, spices]
364                                         [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper]
365                                                                                                                                             [ground beef, ricotta cheese, basil pesto, egg, pasta sauce]
366    [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half and half]
367                                                                                              [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips]
368                                                                           [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam]
369                                                  [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs  dash seasoning mix, pepper, salt]
370                                                                                                                       [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper]
371                                                     [chicken breasts, sun dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil]
372                                                                                                     [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper]
373                                                                         [zucchini, green onions, feta cheese, basil leaves, eggs, salt   fresh ground pepper, plain flour, baking powder, vegetable oil]
374                                             [butter, brown sugar, eggs, baking soda, all purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel]
375                                                                                                                 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery]
Name: pp ingredients, dtype: object

It can be seen that the hyphens, percentages, and numbers that were present in the data previously are now removed.

Tokenization¶

The third preprocessing technique is tokenization where each ingredient will be split into separate words instead of phrases.

In [ ]:
def tokenize_ingredients(ingredients: list[str]) -> list[list[str]]:
    """
    This function tokenizes each ingredient in the ingredients list

    Parameters
    ----------
    ingredients : list[str]
        a list of ingredients to be processed

    Returns
    -------
    list[list[str]]
        a list of tokenized ingredients as lists of tokens
    """
    ingredients = [word_tokenize(ingredient) for ingredient in ingredients]
    return ingredients
In [ ]:
df["pp ingredients"] = df["pp ingredients"].apply(tokenize_ingredients)
In [ ]:
# display the preprocessed ingredients column for indexes 360 to 375
# after applying tokenization
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                                                  [[king, prawns], [fresh, parsley, sprig], [olive, oil], [barbecue, sauce], [lemon, juice], [honey], [garlic, cloves], [fresh, chives]]
361                                                                                                                               [[wonton, wrappers], [ricotta, cheese], [frozen, chopped, spinach], [egg, yolk], [parmesan, cheese], [salt, and, pepper]]
362                                                                                                                                         [[carrot], [olive, oil], [balsamic, vinegar], [honey], [chili, powder], [cumin], [ginger], [salt, and, pepper]]
363                                                                                                                                                                                [[old, fashioned, oats], [water], [cinnamon], [dried, fruits], [spices]]
364                                                     [[pork, shoulder], [olive, oil], [onion], [tomatoes], [garlic, cloves], [cumin, powder], [fresh, oregano], [whole, cloves], [bay, leaves], [dried, chipotle, chiles], [water], [salt, and, pepper]]
365                                                                                                                                                                              [[ground, beef], [ricotta, cheese], [basil, pesto], [egg], [pasta, sauce]]
366    [[bacon], [fresh, mushrooms], [garlic], [olive, oil], [butter], [yellow, onions], [fresh, green, beans], [fresh, thyme], [ground, cumin], [green, onion], [salt], [pepper], [cream, cheese], [whole, milk], [pie, crusts], [egg], [half, and, half]]
367                                                                                                                    [[flour], [baking, soda], [salt], [dark, brown, sugar], [sugar], [margarine], [vanilla], [egg, whites], [water], [chocolate, chips]]
368                                                                                              [[graham, wafer, crumbs], [almonds], [sugar], [butter], [cream, cheese], [flour], [eggs], [sour, cream], [amaretto, di, saronno, liqueur], [apricot, jam]]
369                                                                 [[ground, round], [lean, ground, turkey], [water], [diced, tomatoes], [celery], [onion], [carrots], [beef, bouillon, cubes], [potatoes], [mrs, dash, seasoning, mix], [pepper], [salt]]
370                                                                                                                                                    [[fennel, bulb], [olive, oil], [balsamic, vinegar], [dijon, mustard], [garlic], [salt, and, pepper]]
371                                                                   [[chicken, breasts], [sun, dried, tomatoes], [frozen, spinach], [goat, cheese], [pine, nuts], [shallot], [white, wine], [chicken, broth], [heavy, cream], [salt, and, pepper], [oil]]
372                                                                                                                                [[ham], [light, mayonnaise], [walnuts], [dijon, mustard], [curry, powder], [english, cucumber], [yellow, sweet, pepper]]
373                                                                                               [[zucchini], [green, onions], [feta, cheese], [basil, leaves], [eggs], [salt, fresh, ground, pepper], [plain, flour], [baking, powder], [vegetable, oil]]
374                                                          [[butter], [brown, sugar], [eggs], [baking, soda], [all, purpose, flour], [ground, cinnamon], [ground, nutmeg], [ground, cloves], [ground, ginger], [salt], [pecans], [candied, citron, peel]]
375                                                                                                                                              [[ketchup], [brown, sugar], [yellow, mustard], [worcestershire, sauce], [onion], [bell, pepper], [celery]]
Name: pp ingredients, dtype: object

The data became a list of lists (list of ingredients and each ingredient is a list of tokens). Extra spaces are discarded as well.

Stopwords Removal¶

As for stopwords removal, it is used to remove the conjunctions and other stopwords that may be included in the ingredients names. This will help generalize the ingredients more, as two ingredients called "cookies and cream ice cream" and "cookies & cream ice cream" will be the same ingredient after this process.

In [ ]:
def remove_stopwords(ingredients: list[list[str]]) -> list[list[str]]:
    """
    This function removes stopwords from the ingredient lists.

    Parameters
    ----------
    ingredients : list[list[str]]
        a list of tokenized ingredients as lists of tokens

    Returns
    -------
    list[list[str]]
        a list of tokenized ingredients as lists of tokens without stopwords
    """
    
    stop_words = set(stopwords.words("english"))
    ingredients = [
        [word for word in ingredient if word not in stop_words]
        for ingredient in ingredients
    ]
    return ingredients
In [ ]:
df["pp ingredients"] = df["pp ingredients"].apply(remove_stopwords)
In [ ]:
# display the preprocessed ingredients column for indexes 360 to 375
# after applying stopwords removal
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                                             [[king, prawns], [fresh, parsley, sprig], [olive, oil], [barbecue, sauce], [lemon, juice], [honey], [garlic, cloves], [fresh, chives]]
361                                                                                                                               [[wonton, wrappers], [ricotta, cheese], [frozen, chopped, spinach], [egg, yolk], [parmesan, cheese], [salt, pepper]]
362                                                                                                                                         [[carrot], [olive, oil], [balsamic, vinegar], [honey], [chili, powder], [cumin], [ginger], [salt, pepper]]
363                                                                                                                                                                           [[old, fashioned, oats], [water], [cinnamon], [dried, fruits], [spices]]
364                                                     [[pork, shoulder], [olive, oil], [onion], [tomatoes], [garlic, cloves], [cumin, powder], [fresh, oregano], [whole, cloves], [bay, leaves], [dried, chipotle, chiles], [water], [salt, pepper]]
365                                                                                                                                                                         [[ground, beef], [ricotta, cheese], [basil, pesto], [egg], [pasta, sauce]]
366    [[bacon], [fresh, mushrooms], [garlic], [olive, oil], [butter], [yellow, onions], [fresh, green, beans], [fresh, thyme], [ground, cumin], [green, onion], [salt], [pepper], [cream, cheese], [whole, milk], [pie, crusts], [egg], [half, half]]
367                                                                                                               [[flour], [baking, soda], [salt], [dark, brown, sugar], [sugar], [margarine], [vanilla], [egg, whites], [water], [chocolate, chips]]
368                                                                                         [[graham, wafer, crumbs], [almonds], [sugar], [butter], [cream, cheese], [flour], [eggs], [sour, cream], [amaretto, di, saronno, liqueur], [apricot, jam]]
369                                                            [[ground, round], [lean, ground, turkey], [water], [diced, tomatoes], [celery], [onion], [carrots], [beef, bouillon, cubes], [potatoes], [mrs, dash, seasoning, mix], [pepper], [salt]]
370                                                                                                                                                    [[fennel, bulb], [olive, oil], [balsamic, vinegar], [dijon, mustard], [garlic], [salt, pepper]]
371                                                                   [[chicken, breasts], [sun, dried, tomatoes], [frozen, spinach], [goat, cheese], [pine, nuts], [shallot], [white, wine], [chicken, broth], [heavy, cream], [salt, pepper], [oil]]
372                                                                                                                           [[ham], [light, mayonnaise], [walnuts], [dijon, mustard], [curry, powder], [english, cucumber], [yellow, sweet, pepper]]
373                                                                                          [[zucchini], [green, onions], [feta, cheese], [basil, leaves], [eggs], [salt, fresh, ground, pepper], [plain, flour], [baking, powder], [vegetable, oil]]
374                                                          [[butter], [brown, sugar], [eggs], [baking, soda], [purpose, flour], [ground, cinnamon], [ground, nutmeg], [ground, cloves], [ground, ginger], [salt], [pecans], [candied, citron, peel]]
375                                                                                                                                         [[ketchup], [brown, sugar], [yellow, mustard], [worcestershire, sauce], [onion], [bell, pepper], [celery]]
Name: pp ingredients, dtype: object

The data became clean from stopwords as al stopwords identified in nltk stopwords package are removed.

Lemmatization¶

The last preprocessing technique is lemmatization which is a text normalization technique, it is mainly used to simplify the text and to turn plural words into their singular form. The reason for choosing lemmatization over stemming is mainly due to lemmatization having a higher accuracy than stemming and because it provides actual meaningful words from the dictionary as opposed by stemming that may produce words that are not meaningful (Balodi, 2020).

In [ ]:
def lemmatize_ingredients(ingredients: list[list[str]]) -> list[str]:
    """
    This function applies lemmatization on the ingredients 
    and returns the ingredients into phrases instead of word tokens

    Parameters
    ----------
    ingredients : list[list[str]]
        a list of tokenized ingredients as lists of tokens


    Returns
    -------
    list[str]
        a list of lemmatized and processed ingredients
    """
    lem = WordNetLemmatizer()
    ingredients = [
        [lem.lemmatize(word) for word in ingredient] for ingredient in ingredients
    ]

    ingredients = [" ".join(ingredient) for ingredient in ingredients]
    return ingredients
In [ ]:
df["pp ingredients"] = df["pp ingredients"].apply(lemmatize_ingredients)
In [ ]:
# display the preprocessed ingredients column for indexes 360 to 375
# after applying lemmatization
df.loc[360:375, "pp ingredients"]
Out[ ]:
360                                                                                      [king prawn, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic clove, fresh chive]
361                                                                                                 [wonton wrapper, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt pepper]
362                                                                                                           [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt pepper]
363                                                                                                                                         [old fashioned oat, water, cinnamon, dried fruit, spice]
364                                            [pork shoulder, olive oil, onion, tomato, garlic clove, cumin powder, fresh oregano, whole clove, bay leaf, dried chipotle chile, water, salt pepper]
365                                                                                                                                     [ground beef, ricotta cheese, basil pesto, egg, pasta sauce]
366    [bacon, fresh mushroom, garlic, olive oil, butter, yellow onion, fresh green bean, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crust, egg, half half]
367                                                                                        [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg white, water, chocolate chip]
368                                                                      [graham wafer crumb, almond, sugar, butter, cream cheese, flour, egg, sour cream, amaretto di saronno liqueur, apricot jam]
369                                                  [ground round, lean ground turkey, water, diced tomato, celery, onion, carrot, beef bouillon cube, potato, mr dash seasoning mix, pepper, salt]
370                                                                                                                   [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt pepper]
371                                                     [chicken breast, sun dried tomato, frozen spinach, goat cheese, pine nut, shallot, white wine, chicken broth, heavy cream, salt pepper, oil]
372                                                                                              [ham, light mayonnaise, walnut, dijon mustard, curry powder, english cucumber, yellow sweet pepper]
373                                                                       [zucchini, green onion, feta cheese, basil leaf, egg, salt fresh ground pepper, plain flour, baking powder, vegetable oil]
374                                            [butter, brown sugar, egg, baking soda, purpose flour, ground cinnamon, ground nutmeg, ground clove, ground ginger, salt, pecan, candied citron peel]
375                                                                                                         [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery]
Name: pp ingredients, dtype: object

After lemmatizing the data, all ingredients got lemmatized and became into singular form.

Preprocessed Text Visualization¶

In [ ]:
# flatten all ingredients values into a 1D list
flatten_ingredients = [ingredient for ingredients in df['ingredients'].tolist() for ingredient in ingredients]
flatten_pp_ingredients = [ingredient for ingredients in df['pp ingredients'].tolist() for ingredient in ingredients]

# Count the frequency of the ingredients
ingredient_counts = Counter(flatten_ingredients)
ingredient_pp_counts = Counter(flatten_pp_ingredients)

# Get the top 30 most common ingredients
top_ingredients = ingredient_counts.most_common(30)
steps, counts = zip(*top_ingredients)

top_pp_ingredients = ingredient_pp_counts.most_common(30)
pp_ingredients, pp_counts = zip(*top_pp_ingredients)


fig = sp.make_subplots(rows=2, cols=1, vertical_spacing=0.2)

fig.add_trace(
    go.Bar(x=steps, y=counts, name='Original Ingredients'),
    row=1, col=1
)
fig.add_trace(
    go.Bar(x=pp_ingredients, y=pp_counts, name='Processed Ingredients'),
    row=2, col=1
)

# Create the bar plot
fig.update_layout(title_text="Top 30 Ingredient Frequencies Before and After Processing", height=500)
fig.show()

The bar plots indicate that salt, butter and sugar are the top 3 most frequent ingredients both before and after preprocessing with the same frequency, as for egg, it moved from fifth place before preprocessing to fourth place, this might be mainly because in some ingredients it was identified as egg and in others it was identified as eggs and the preprocessing unified them as one ingredient egg. As for the fifth place, it was taken by onion which moved from the sixth place.

In [ ]:
print(
    "Number of unique ingredients before preprocessing was: ",
    len(ingredient_counts),
    "\nNumber of unique ingredients after preprocessing is: ",
    len(ingredient_pp_counts),
    "\nThis shows that the preprocessing unified a lot of similar ingredients that were previously written with slight changes\nand narrowed the number of different ingredients by",
    len(ingredient_counts) - len(ingredient_pp_counts),
    "decreasing the dimensionality of the data,",
    "\nwhich will help in improving the performance of the ML models"
)
Number of unique ingredients before preprocessing was:  8511 
Number of unique ingredients after preprocessing is:  7717 
This shows that the preprocessing unified a lot of similar ingredients that were previously written with slight changes
and narrowed the number of different ingredients by 794 decreasing the dimensionality of the data, 
which will help in improving the performance of the ML models

4. Text Representation¶

Three different text representation techniques are performed on the preprocessed ingredients to convert them into vectors in preparation for the classification models development.

Bag of Words (BoW)¶

The first technique is bag of words to represent the ingredients as vectors of words frequencies.

In [ ]:
c_vectorizer = CountVectorizer()
bow_ing = c_vectorizer.fit_transform(df["pp ingredients"].apply(' '.join))
In [ ]:
# calculate the term frequency for each term
sum_words = bow_ing.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in c_vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# create a DataFrame with only the top 30 terms in terms of frequency
bow_top_30 = words_freq[:30]
bow_top_30_df = pd.DataFrame(bow_top_30, columns=['Term', 'Frequency'])

Term Frequency - Inverse Document Frequency (TF-IDF)¶

The second technique is term frequency - inverse document frequency which is an advanced form of bag of words that considers the importance of the term in the entire set.

In [ ]:
# initialize TfidfVectorizer
t_vectorizer = TfidfVectorizer()

# learn the 'vocabulary' of the documents and transform the documents into a document-term matrix
tfidf_ing = t_vectorizer.fit_transform(df["pp ingredients"].apply(' '.join))
In [ ]:
# calculate the term frequency for each term
sum_words = tfidf_ing.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in t_vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

# create a DataFrame with only the top 30 terms in terms of frequency
tfidf_top_30 = words_freq[:30]

tfidf_top_30_df = pd.DataFrame(tfidf_top_30, columns=['Term', 'TF-IDF score'])
In [ ]:
# create bar plots for top 30 frequencies of BoW and TF-IDF
fig = sp.make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Bar(x=bow_top_30_df['Term'], y=bow_top_30_df['Frequency'], name='BoW'),
    row=1, col=1
)
fig.add_trace(
    go.Bar(x=tfidf_top_30_df['Term'], y=tfidf_top_30_df['TF-IDF score'], name='TF-IDF'),
    row=2, col=1
)

# Create the bar plot
fig.update_layout(title_text="Top 30 Words Frequencies with BoW and TF-IDF", height=500)
fig.show()

The bar plots indicate that some words are have a higher frequency in the ingredients data. However, when considering the importance of these frequent words with TF-IDF compared to their frequency, some of them have lower importance such as 'salt' and 'pepper' while others like 'butter' have higher importance.

Word Embeddings with Word2Vec¶

The third technique is word embedding using word2vec to create embeddings for the ingredients that capture the semantic relationship between the words.

In [ ]:
# Train a Word2Vec model
w2v_ing = Word2Vec(df["pp ingredients"], min_count=1)
In [ ]:
# Select a random sample of words
similar_words = w2v_ing.wv.most_similar("chocolate", topn=30)
words = [word for word, _ in similar_words]
similarities = [similarity for _, similarity in similar_words]

vectors = w2v_ing.wv[words]

# Perform PCA
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

# Create a DataFrame with the PCA results
df2 = pd.DataFrame(result, columns=["PC1", "PC2"])
df2["word"] = words
df2["similarity"] = similarities

# Create a scatter plot using Plotly Express
fig = px.scatter(
    df2,
    x="PC1",
    y="PC2",
    text="word",
    color="similarity",
    hover_data=["word", "similarity"],
    title="Top 30 words similar to the word 'Chocolate'",
)

# Update layout properties
fig.update_traces(textposition="bottom right")

# Show the plot
fig.show()
In [ ]:
def embed_ingredients(ingredients):

    # get a list of each recipe's ingredients vectors
    ingredients_vec = [
        w2v_ing.wv[ingredient]
        for ingredient in ingredients
        if ingredient in w2v_ing.wv
    ]

    # If the recipe had no ingredients in the model's vocabulary, return a zero vector
    if len(ingredients_vec) == 0:
        return np.zeros(w2v_ing.vector_size)
    
    # Otherwise, return the vectors
    return ingredients_vec
In [ ]:
# insert the embeddings in a new column in the dataframe
df['ing_embeddings'] = df['pp ingredients'].apply(embed_ingredients)

5. Text Classification / Prediction¶

The first step before building the classification model is to identify the input and target before splitting the data into train and test sets. For the input (ingredients), they are fed to the models as Bag of Words or TF-IDF vectors.

In [ ]:
# target
y = df["label"]

Text Classification with Bag of Words vectors¶

In [ ]:
X_bow = bow_ing
In [ ]:
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

Random Forest Classification¶

In [ ]:
rfc_bow = RandomForestClassifier(random_state=42)
rfc_bow.fit(X_train_bow, y_train)

# Make predictions
y_pred_rfc_bow = rfc_bow.predict(X_test_bow)

Support Vector Machine Classification¶

In [ ]:
svc_bow = SVC()
svc_bow.fit(X_train_bow, y_train)

# Make predictions
y_pred_svc_bow = svc_bow.predict(X_test_bow)

K Nearest Neighbors Classification¶

In [ ]:
knn_bow = KNeighborsClassifier(n_neighbors=15)
knn_bow.fit(X_train_bow, y_train)

# Make predictions
y_pred_knn_bow = knn_bow.predict(X_test_bow)

Text Classification with TF-IDF vectors¶

In [ ]:
# input with TF-IDF vectorization
X_tfidf = tfidf_ing
In [ ]:
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

Random Forest Classification¶

In [ ]:
rfc_tfidf = RandomForestClassifier(random_state=42)
rfc_tfidf.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_rfc_tfidf = rfc_tfidf.predict(X_test_tfidf)

Support Vector Machine Classification¶

In [ ]:
svc_tfidf = SVC()
svc_tfidf.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_svc_tfidf = svc_tfidf.predict(X_test_tfidf)

K Nearest Neighbors Classification¶

In [ ]:
knn_tfidf = KNeighborsClassifier(n_neighbors=15)
knn_tfidf.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_knn_tfidf = knn_tfidf.predict(X_test_tfidf)

6. Evaluation, Inferences, Recommendations and Reflection¶

Evaluation¶

In [ ]:
# Display classification reports for each model
print("Bag of Words Vectorizer")
print("_"*100)

print("\nRandom Forest Classifier:")
print(classification_report(y_test, y_pred_rfc_bow))

print("\nSupport Vector Machine Classifier:")
print(classification_report(y_test, y_pred_svc_bow))

print("K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn_bow))

print("="*100)

print("TF-IDF Vectorizer")
print("_"*100)

print("\nRandom Forest Classifier:")
print(classification_report(y_test, y_pred_rfc_tfidf))

print("\nSupport Vector Machine Classifier:")
print(classification_report(y_test, y_pred_svc_tfidf))

print("K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn_tfidf))
Bag of Words Vectorizer
____________________________________________________________________________________________________

Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.69      0.77      0.73      2009
           1       0.62      0.63      0.62      2023
           2       0.60      0.52      0.56      1968

    accuracy                           0.64      6000
   macro avg       0.64      0.64      0.64      6000
weighted avg       0.64      0.64      0.64      6000


Support Vector Machine Classifier:
              precision    recall  f1-score   support

           0       0.73      0.76      0.74      2009
           1       0.64      0.65      0.65      2023
           2       0.61      0.57      0.59      1968

    accuracy                           0.66      6000
   macro avg       0.66      0.66      0.66      6000
weighted avg       0.66      0.66      0.66      6000

K-Nearest Neighbors Classifier:
              precision    recall  f1-score   support

           0       0.74      0.37      0.49      2009
           1       0.45      0.77      0.57      2023
           2       0.54      0.41      0.47      1968

    accuracy                           0.52      6000
   macro avg       0.58      0.52      0.51      6000
weighted avg       0.58      0.52      0.51      6000

====================================================================================================
TF-IDF Vectorizer
____________________________________________________________________________________________________

Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.67      0.79      0.72      2009
           1       0.63      0.61      0.62      2023
           2       0.60      0.51      0.55      1968

    accuracy                           0.64      6000
   macro avg       0.63      0.63      0.63      6000
weighted avg       0.63      0.64      0.63      6000


Support Vector Machine Classifier:
              precision    recall  f1-score   support

           0       0.74      0.76      0.75      2009
           1       0.65      0.65      0.65      2023
           2       0.61      0.59      0.60      1968

    accuracy                           0.67      6000
   macro avg       0.67      0.67      0.67      6000
weighted avg       0.67      0.67      0.67      6000

K-Nearest Neighbors Classifier:
              precision    recall  f1-score   support

           0       0.69      0.68      0.68      2009
           1       0.57      0.62      0.59      2023
           2       0.56      0.50      0.53      1968

    accuracy                           0.60      6000
   macro avg       0.60      0.60      0.60      6000
weighted avg       0.60      0.60      0.60      6000

In [ ]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

# Row 1
skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_rfc_bow,
    normalize=True,
    title="Random Forest with Bag of Words",
    cmap="Blues",
    ax=axes[0, 0]
)

skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_svc_bow,
    normalize=True,
    title="Support Vector Machine with Bag of Words",
    cmap="Blues",
    ax=axes[0, 1]
)

skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_knn_bow,
    normalize=True,
    title="K Nearest Neighbors with Bag of Words",
    cmap="Blues",
    ax=axes[0, 2]
)

# Row 2
skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_rfc_tfidf,
    normalize=True,
    title="Random Forest with TF-IDF",
    cmap="Purples",
    ax=axes[1, 0]
)

skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_svc_tfidf,
    normalize=True,
    title="Support Vector Machine with TF-IDF",
    cmap="Purples",
    ax=axes[1, 1]
)

skplt.metrics.plot_confusion_matrix(
    y_test, y_pred_knn_tfidf,
    normalize=True,
    title="K Nearest Neighbors with TF-IDF",
    cmap="Purples",
    ax=axes[1, 2]
)

# Adjust layout for better appearance
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

The Labels:

  • 0 -> High Protein Diet
  • 1 -> Low Calorie Diet
  • 2 -> Other Diet

Inferences¶

Based on the evaluation of the six created models, it can be seen from both the classification reports and the confusion matrices that the support vector machine has the highest accuracy with tf-idf vectors at 67% followed by support vector machine as well but with bag of words at 66% and then Random Forest with both Bag of Words and TF-IDF vectors at 64%.

Multiple Deep Learning models including LSTM and RNN were tested but provided bad results besides other machine learning models such as logistic regression, and Naive Bayes. The same models "Random Forest", "SVC", "KNN" were tested with word2vec embeddings as well but had lower accuracy than Bag of Words and TF-IDF vectors trained models. This might be due to the complexity of word2vec which makes it more suitable with deep learning models.

This inference leads to the conclusion that there might not be a strong relationship between the ingredients and the calories and protein intakes. It is possible that the performance of the models can be improved by including additional textual information related to the diet.

Another inference possibility is that the developed models are not suitable and other models could have performed better.

Recommendations and Reflection¶

As recommendations on the development of this project, it would be good to experiment more with the dataset and try different categories and labels besides the diet such as the cuisine. Moreover, applying neural networks might aid in providing more accurate results but with larger datasets as with the current dataset, both RNN and LSTM models were tested both with tensorflow tokens and with word2vec embeddings but didn't perform well. In addition, the dataset could also be used for a text generation model where the input is the ingredients and the model has to generate the preparation steps for a recipe with these ingredients which is an interesting use case.

As for reflection on the project, the results of the developed models could have been further improved by experimenting with more classification models and models hyper parameters. The use of cross-validation could have been useful as well in improving the performance.


Log files¶

image.png

Github Link¶

https://github.com/aalamohamed/NLP-RD.git


References¶

Balodi, T. (2020, July 14). What is Stemming and Lemmatization in NLP? Retrieved from analyticssteps: https://www.analyticssteps.com/blogs/what-stemming-and-lemmatization-nlp

Li, S. (2019). Food.com Recipes and Interactions, [Data set]. Kaggle. doi: https://doi.org/10.34740/KAGGLE/DSV/783630

Reading Food Nutrition Labels. (n.d.). Retrieved from Washington State Department of Social and Health Services: https://www.dshs.wa.gov/sites/default/files/ALTSA/stakeholders/documents/duals/toolkit/Reading%20Food%20Nutrition%20Labels.pdf

Wempen, K. (2022, April 29). Are you getting too much protein? Retrieved from mayo clinic health system: https://www.mayoclinichealthsystem.org/hometown-health/speaking-of-health/are-you-getting-too-much-protein#:~:text=General%20recommendations%20are%20to%20consume,30%20grams%20at%20one%20time.